Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FS hooks proposal #43

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

FS hooks proposal #43

wants to merge 1 commit into from

Conversation

RaisinTen
Copy link
Contributor

This proposal documents a Loaders API-like API for providing hooks into the filesystem.

Signed-off-by: Darshan Sen [email protected]

cc @nodejs/single-executable @nodejs/loaders

This proposal documents a Loaders API-like API for providing hooks into the filesystem.
Copy link
Contributor

@arcanis arcanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really want to see this document explain:

  • What makes this proposal superior to patching in CJS / a node:fs resolver override in ESM (I know I've been very vocal about this lately, but the best way to shut me up is to tell me exactly why it's not good enough, and how this is better; at the moment this RFC is just "let's do X", but you don't explain why, or pit it against alternatives. I already know the reasons for my opinions, I'd like to see yours).

  • Why passing the VFS on the command line is better than registering them at runtime.

  • Whether you intend this API to ever support write operations, if yes how, if not why (I could easily imagine memory filesystems being a very useful use case; imo even if the implementation's first iteration doesn't include write operations, they should still be covered in the RFC, as they are a natural follow-up and inform the design).

  • What's the expected performance overhead of this API (it seems very inefficient at the moment, since checking a file or folder existence requires reading it).

  • How you split synchronous and asynchronous operations; you only covered sync calls (which works well with a huge memory buffer), but that could kill performances (if, instead, the files are loaded from another place; say, if we were to write a VFS that would simply mirror another folder).

  • How composition is supposed to work.

  • And various other features:

    • Symlinks
    • Realpath
    • File descriptors
    • fs.opendir
    • FileHandle
    • Read streams

@jviotti
Copy link
Member

jviotti commented Sep 22, 2022

Whether you intend this API to ever support write operations, if yes how, if not why (I could easily imagine memory filesystems being a very useful use case; imo even if the implementation's first iteration doesn't include write operations, they should still be covered in the RFC, as they are a natural follow-up and inform the design).

In the case of SEAs, VFS' are by definition read-only. I agree that there is room for supporting write operations if the VFS lives outside of the binary, which we can mention we would like to support later, but not worry about them for now. In any case, providing hooks for write functions in the style of loaders hooks seem to naturally support that too.

@arcanis
Copy link
Contributor

arcanis commented Sep 22, 2022

In any case, providing hooks for write functions in the style of loaders hooks seem to naturally support that too.

It really depends on what you see as the selling point of the API in this proposal compared to its alternatives. If "lean" is one, as I seem to understand, then removing all write operations (or moving them behind "let's not worry about them now") doesn't make for a fair comparison.

@jviotti
Copy link
Member

jviotti commented Sep 22, 2022

@RaisinTen I think this is on the right path. For many applications, correctly monkey-patching all Node.js involved modules in a way that is 100% compatible with Node.js (and that remains so after changes to Node.js) is a challenge.

What I would like to avoid (which you are already considering in your proposal) is asking people to write a hooks file that re-exposes EVERY fs function. That solves the problem of monkey-patching and ESM support, but still requires people to carefully keep their hooks implementation matching Node.js.

Instead, I think that the ideal fs hooks API would ask you to re-implement a small set of lower-level functions that the fs module uses to provide all the higher level functions based on them. This means that the main problem for coming up with a proposal is determining what this small set of functions is. For example:

  • Get a file description from a file based on its path
  • A low-level read and write operations
  • etc

@jviotti
Copy link
Member

jviotti commented Sep 22, 2022

@arcanis Not sure I follow your comment.

the selling point of the API in this proposal compared to its alternatives

What are all the alternatives in this case? I'm only aware of monkey-patching.

The idea goals of this hooks API, as I mentioned in #43 (comment) a moment ago, is to allow people to hook into fs, require, import and child_process without monkey-patching and without having to re-implement every single function (ideally just a small subset).

@arcanis
Copy link
Contributor

arcanis commented Sep 22, 2022

What are all the alternatives in this case? I'm only aware of monkey-patching.

That's the one I have in mind, yes.

The idea goals of this hooks API, as I mentioned in #43 (comment) a moment ago, is to allow people to hook into fs, require, import and child_process without monkey-patching and without having to re-implement every single function (ideally just a small subset).

You're talking about having to re-implement every single function when patching, but the API in this document only covers a very small surface of the fs API. If we only care about this surface, then even when patching you don't have to patch more than a handful of fs methods (readFileSync, statSync, readdirSync, and existsSync, on the top of my head?), making the problem you raise moot (you don't have to re-implement "every single function", just a small subset).

And if the surface is larger, then I have my doubts that we'll end up on a significantly smaller surface than fs - it includes a few helpers we may not care as much (recursive copy, recursive delete), but most of them still directly map to unix syscalls or primitives important to optimize (such as readFile, which would be quite less efficient if it was turned into an open/read/close call against the hooks).

@jviotti
Copy link
Member

jviotti commented Sep 23, 2022

but the API in this document only covers a very small surface of the fs API

Yeah, definitely. This is why I mentioned that this is just a start.

but most of them still directly map to unix syscalls or primitives important to optimize (such as readFile, which would be quite less efficient if it was turned into an open/read/close call against the hooks).

That is a good point, and the thing I want to explore further. Taking readFile as an example, what it seems to do is:

  • Open a file descriptor to the path
  • Read the file through its file descriptor

If we could hook into the lower-level binding.open() to return a file descriptor on the same file, seeked to the right position and with size information, then the rest of readFile might proceed as normal, without any additional code being changed.

I don't know if the same kind of argument holds for other functions in fs, but it sounds worth doing the exercise.

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be something we want to bootstrap/register from code too and operate in a mode where I can read from the VFS but also a regular file system.

Something like:

fs.registerVirtualFileSystem('zip://myfile', {
  ...
})

@GeoffreyBooth
Copy link
Member

  • What makes this proposal superior to patching in CJS / a node:fs resolver override in ESM (I know I’ve been very vocal about this lately, but the best way to shut me up is to tell me exactly why it’s not good enough, and how this is better; at the moment this RFC is just “let’s do X”, but you don’t explain why, or pit it against alternatives. I already know the reasons for my opinions, I’d like to see yours).

I think hooks that are untethered from the public APIs are in general superior to overriding public APIs because the hooks allow more flexibility. To take the Loaders API for example, we don’t simply override import or provide a hook named import; the hooks are resolve and load, for the two phases of importing, and loader authors can override one or both. Even more importantly, composition can happen horizontally, across several loaders’ resolve hooks, for example, before moving on to several chained load hooks.

In the case of readFile, for example, it was stated above that this is a combination of two lower-level operations: open and read. These are very similar conceptually to resolve and load, and I can imagine use cases were a library might want to customize open (to redirect to a different file path, for example) but not read. And then like with loaders and chained resolve hooks, multiple registered open hooks could be chained before the eventually opened file is read.

  • Why passing the VFS on the command line is better than registering them at runtime.

In general I’d like the registration of hooks to be possible from within user code. Something like:

import { registerHooks } from 'customization'

await registerHooks({
  fileOpen: () => { /* ... */ },
  fileRead: () => { /* ... */ },
  replSuggest: () => { /* ... */ },
})

await import('./entry.js')

Loaders might need to be a special case because they need to happen before any user code is even parsed, but for everything else we’ve been considering the registration needs to happen only before the relevant patched API is called. So if you have a virtual filesystem patch and you run your app like node --import virtual-filesystem/register app.js, the registration would happen before any app code calls any fs functions and so therefore it would happen early enough. Or you could register at the very start of the application before doing anything meaningful, as in my example above. I would think that we don’t need to create any new flags for customization hooks related to the file system, child processes, REPL or source maps.

Many of these questions are related to the design of a customization API in general, for any subsystem; so not just FS but child process and so on too. Hence in nodejs/loaders#95 the suggestion for making a new team and API for this, to use the Loaders API as a point of reference for making similar customization hooks available for other subsystems.

@arcanis
Copy link
Contributor

arcanis commented Sep 23, 2022

I think hooks that are untethered from the public APIs are in general superior to overriding public APIs because the hooks allow more flexibility. To take the Loaders API for example, we don’t simply override import or provide a hook named import; the hooks are resolve and load, for the two phases of importing, and loader authors can override one or both.

I'm not sure I subscribe to this analysis. The way the loaders API works is fine, but I don't see it as superior to a patching approach: I find them equivalent, nothing more. Additional differences with the present situation were also that:

  • Off-thread resolution made things more complex, and the necessity to define a set of resolver before starting the main evaluation would have made patching perhaps more confusing as an API

  • There just wasn't a good place to do this patching; the closest were Module._load and Module._resolveFilename, which were tied to the CJS resolution and mixed a lot of concerns; the same isn't true of fs, which generally maps one function to one filesystem operation

Finally, loaders are also primarily intended for ESM, and not having to deal with CJS compatibility gave them some leeway. Whereas whatever system we pick must be retro-compatible with the existing VFS layers - that currently work through patching.

In the case of readFile, for example, it was stated above that this is a combination of two lower-level operations: open and read. These are very similar conceptually to resolve and load, and I can imagine use cases were a library might want to customize open (to redirect to a different file path, for example) but not read.

And what do I do if I want to customize the whole readFile, in such a way that I don't have to keep file descriptors open, for any period of time, because it'd be a wasteful and potentially slow process? Especially with readFile being such an important function copiously used by almost all JS tools where performance matters: linters, bundlers, typecheckers, resolvers, ...

@arcanis
Copy link
Contributor

arcanis commented Sep 23, 2022

A quick note however: I agree that there's some "overhead" in fs that's annoying to implement (in particular parameters validation or normalization on variadic functions, such as read). I'm not saying fs is perfect today.

My point is mostly that, before reinventing a whole new API, before thinking about destructuring operations into file descriptors, I really think it'd make more sense to think about the small, incremental, improvements we can make to the existing APIs, which ones we can't, and see if we really need anything more after that.

I'd be really too bad to be 90% of the way, and reset all progress because we decided to try again from scratch without at least evaluating the deficiencies of the current system and why it can't be solved.

@Qard
Copy link
Member

Qard commented Sep 23, 2022

If we want to consider creating a fs hooks API in core we should probably also be carefully considering the possible impact it could have if we at some point decide to implement the FileSystem Web API.

Also, it's worth mentioning that many use cases for hooks could be served by just adding diagnostics_channel events to fs.

@mcollina
Copy link
Member

mcollina commented Sep 24, 2022

Unfortunately while monkey patching might be functionally equivalent it makes Node.js significantly harder to maintain and evolve.

There are quite a lot of modules in the ecosystem that monkeypatch node core behavior. Unfortunately this makes Node.js significantly harder to maintain. We had quite a few cases in the past where those modules became unmaintained and it made it significantly harder to improve core, and we had to step in ourselves.

Every module that monkeypatches core makes core harder to maintain long term.

@Flarna
Copy link
Member

Flarna commented Sep 24, 2022

I think both monkeypatching fs and replacing fs via loaders/hooks results in a simialar burden. Everyone expects that the replacement is as good/bad as the original therefore any change in the original could be seen as breaking just because something changed.

Another point to consider here is likely that there might be more VFS in place at the same time:

  • package managers might use it to support zipped modules
  • some other framework might use it to store it's resources efficient on disk
  • there might be serveral such frameworks in a single application

Therefore we should ensure that to create a solution working for multiple users where one is usualy not aware of the other.

@Qard
Copy link
Member

Qard commented Sep 24, 2022

Monkey-patching is much worse in the long term as it tends to involve touching internals and unspecified behaviour which is likely to change in subtle ways over time. Having worked in APM for about a decade and dealt with these changes, I know it can be quite a huge source of pain identifying when something goes wrong.

This is a big part of why I made diagnostics_channel. It not only provides a patch-less method of observing things but also, and in my opinion more importantly, it defines a clear contract of what can be observed which allows that surface area to be considered in how we express semver changes. This is very important for APMs to be actually stable as there has been many cases over the years where instrumentation of Node.js core broke in non-major releases and left us scrambling to fix the issue, putting our own regular work on hold unexpectedly.

It's also worth pointing out the havoc that fs-extras is known for inflicting on the environment due to its patching. I would strongly advise against endorsing monkey-patching as an acceptable way to achieve something like this, especially if we want it changing behaviour in new ways like adding access control. Such behavioural differences will likely break expectations of other patches and that will just lead to bad times for all. 😬

@Flarna
Copy link
Member

Flarna commented Sep 24, 2022

I'm not advocating for monkey patching - I agree that is a pain. But I fear that using loader hooks to replace fs is by no means better. In the end both variants do a global modification to reach the goal to add an isolated functionality.

I fully agree that diagnostics channel is good to avoid monkey patching for monitoring. DC is intended for monitoring but here the target is changed functionality.

I know that by passing muteable objects listners can reach side effects (expected or not for publisher). But I doubt it is a good solution if a bunch of (unsorted) listeners start tinkering on these monitored objects.

File systems are usually handled by OS and the complexity there is high for good reason. I doubt we should add the same in node. Therefore I think it's important to focus on which target should be archieved.
is SEA the only target or do we aim to support multiple VFSs in parallel?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants